{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Renumbering Test\n",
    "\n",
    "In this notebook, we will use the _renumber_ function to compute new vertex IDs.\n",
    "\n",
    "Under the covers, cuGraph represents a graph as a matrix in Compressed Sparse Row format (see https://en.wikipedia.org/wiki/Sparse_matrix).  The problem with a matrix representation is that there is a column and row for every possible vertex ID.  Therefore, if the data contains vertex IDs that are non-contiguious, or which start at a large initial value, then there is a lot of empty space that uses up memory.      \n",
    "\n",
    "An alternative case is using renumbering to convert from one data type down to a contiguious sequence of integer IDs.  This is useful when the dataset contain vertex IDs that are not integers.  \n",
    "\n",
    "\n",
    "Notebook Credits\n",
    "* Original Authors: Chuck Hastings and Bradley Rees\n",
    "* Created:   08/13/2019\n",
    "* Updated:   03/03/2020\n",
    "\n",
    "RAPIDS Versions: 0.13   \n",
    "\n",
    "Test Hardware\n",
    "\n",
    "* GV100 32G, CUDA 10.2\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Demonstrate creating a graph with renumbering.\n",
    "\n",
    "Most cugraph algorithms operate on a CSR representation of a graph.  A CSR representation requires an indices array that is as long as the number of edges and an offsets array that is as 1 more than the largest vertex id.  This makes the memory utilization entirely dependent on the size of the largest vertex id.  For data sets that have a sparse range of vertex ids, the size of the CSR can be unnecessarily large.  It is easy to construct an example where the amount of memory required for the offsets array will exceed the amount of memory in the GPU (not to mention the performance cost of having a large number of offsets that are empty but still have to be read to be skipped).\n",
    "\n",
    "The cugraph renumbering feature allows us to take two columns of any integer type and translate them into a densely packed contiguous array numbered from 0 to (num_unique_values - 1).  These renumbered vertices can be used to create a graph much more efficiently.\n",
    "\n",
    "Another of the features of the renumbering function is that it can take vertex ids that are 64-bit values and map them down into a range that fits into 32-bit integers.  The current cugraph algorithms are limited to 32-bit signed integers as vertex ids. and the renumbering feature will allow the caller to translate ids that are 64-bit into a densly packed 32-bit array of ids that can be used in cugraph algorithms.  Note that if there are more than 2^31 - 1 unique vertex ids then the renumber method will fail with an error indicating that there are too many vertices to renumber into a 32-bit signed integer.\n",
    "\n",
    "Note that this version (0.10) is limited to integer types.  The intention is to extend the renumbering function to be able to handle strings and other types."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First step is to import the needed libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cugraph\n",
    "import cudf\n",
    "import socket\n",
    "import struct\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import networkx as nx\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create some test data\n",
    "\n",
    "This creates a small circle using some ipv4 addresses, storing the columns in a GPU data frame.\n",
    "\n",
    "The current version of renumbering operates only on integer types, so we translate the ipv4 strings into 64 bit integers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "source_list = [ '192.168.1.1', '172.217.5.238', '216.228.121.209', '192.16.31.23' ]\n",
    "dest_list = [ '172.217.5.238', '216.228.121.209', '192.16.31.23', '192.168.1.1' ]\n",
    "source_as_int = [ struct.unpack('!L', socket.inet_aton(x))[0] for x in source_list ]\n",
    "dest_as_int = [ struct.unpack('!L', socket.inet_aton(x))[0] for x in dest_list ]\n",
    "\n",
    "\n",
    "print(\"sources came from: \" + str([ socket.inet_ntoa(struct.pack('!L', x)) for x in source_as_int ]))\n",
    "print(\"  sources as int = \" + str(source_as_int))\n",
    "print(\"destinations came from: \" + str([ socket.inet_ntoa(struct.pack('!L', x)) for x in dest_as_int ]))\n",
    "print(\"  destinations as int = \" + str(dest_as_int))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create our GPU data frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\n",
    "        'source_list': source_list,\n",
    "        'dest_list': dest_list,\n",
    "        'source_as_int': source_as_int,\n",
    "        'dest_as_int': dest_as_int\n",
    "        })\n",
    "\n",
    "gdf = cudf.DataFrame.from_pandas(df[['source_as_int', 'dest_as_int']])\n",
    "\n",
    "gdf.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Run renumbering\n",
    "\n",
    "The current version of renumbering takes a column of source vertex ids and a column of dest vertex ids.  As mentioned above, these must be integer columns.\n",
    "\n",
    "Output from renumbering is 3 cudf.Series structures representing the renumbered sources, the renumbered destinations and the numbering map which maps the new ids back to the original ids.\n",
    "\n",
    "In this case,\n",
    " * gdf['source_as_int'] is a column of type int64\n",
    " * gdf['dest_as_int'] is a column of type int64\n",
    " * src_r will be a series of type int32 (we translate back to 32-bit integers)\n",
    " * dst_r will be a series of type int32\n",
    " * numbering will be a series of type int64 that translates the elements of src and dst back to their original 64-bit values\n",
    " \n",
    "Note that because the renumbering translates us to 32-bit integers, if there are more than 2^31 - 1 unique 64-bit values in the source/dest passed into renumbering this would exceed the size of the 32-bit integers so you will get an error from the renumber call. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "src_r, dst_r, numbering = cugraph.renumber(gdf['source_as_int'], gdf['dest_as_int'])\n",
    "\n",
    "gdf.add_column(\"original id\", numbering)\n",
    "gdf.add_column(\"src_renumbered\", src_r)\n",
    "gdf.add_column(\"dst_renumbered\", dst_r)\n",
    "\n",
    "gdf.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data types\n",
    "\n",
    "Just to confirm, the data types of the renumbered columns should be int32, the original data should be int64, the numbering map needs to be int64 since the values it contains map to the original int64 types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Quick verification\n",
    "\n",
    "To understand the renumbering, here's a block of verification logic.  In the renumbered series we created a new id for each unique value in the original series.  The numbering map identifies that mapping.  For any vertex id X in the new numbering, numbering[X] should refer to the original value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(len(src_r)):\n",
    "    print(\" \" + str(i) +\n",
    "          \": (\" + str(source_as_int[i]) + \",\" + str(dest_as_int[i]) +\")\"\n",
    "          \", renumbered: (\" + str(src_r[i]) + \",\" + str(dst_r[i]) +\")\"\n",
    "          \", translate back: (\" + str(numbering[src_r[i]]) + \",\" + str(numbering[dst_r[i]]) +\")\"\n",
    "         )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Now let's do some graph things...\n",
    "\n",
    "To start, let's run page rank.  Not particularly interesting on our circle, since everything should have an equal rank."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G = cugraph.Graph()\n",
    "gdf_r = cudf.DataFrame()\n",
    "gdf_r[\"src\"] = src_r\n",
    "gdf_r[\"dst\"] = dst_r\n",
    "G.from_cudf_edgelist(gdf_r, source='src', destination='dst')\n",
    "\n",
    "pr = cugraph.pagerank(G)\n",
    "\n",
    "pr.add_column(\"original id\", numbering)\n",
    "pr.to_pandas()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Try to run jaccard\n",
    "\n",
    "Not at all an interesting result, but it demonstrates a more complicated case.  Jaccard returns a coefficient for each edge.  In order to show the original ids we need to add columns to the data frame for each column that contains one of renumbered vertices.  In this case, the columns source and destination contain renumbered vertex ids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "jac = cugraph.jaccard(G)\n",
    "\n",
    "jac.add_column(\"original_source\",\n",
    "               [ socket.inet_ntoa(struct.pack('!L', numbering[x])) for x in jac['source'] ])\n",
    "\n",
    "jac.add_column(\"original_destination\",\n",
    "               [ socket.inet_ntoa(struct.pack('!L', numbering[x])) for x in jac['destination'] ])\n",
    "\n",
    "jac.to_pandas()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "Copyright (c) 2019-2020, NVIDIA CORPORATION.\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
    "___"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cugraph_dev",
   "language": "python",
   "name": "cugraph_dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
